Libraries and utility functions

Load data

We are going to use the Mushroom Classification Dataset, which is obtained from UCI Repository. It is available on the following link.

Data pre-processing

Structure exploration

Missing values detection

Before we apply any ML models, we need to examine wether there are missing values in the data.

**Conclusion**: There are not features with missing values in the dataset.

Encoding categorical features

In order to perform the supervised learning algorithms discussed in the course Machine Learning, we need to encode the features. For this purpouse sklearn.preprocessing.LabelEncoder is going to be leveraged.

**Conclusion**: All of the data are numerical and can be represented within vector spaces.

Exploratory Data Analysis (EDA)

For visualizations libraries plotly and seaborn are going to be used, the same can be achived using ma

Distribution of the features

Pair-plots

Between feature dependance

Visualization with reduced dimensionality

PCA approach

LDA approach

Classification

Before we apply any of the models, we need to divide our dataset into train and test subsets. We are going to leverage sklearn.model_selection.train_test_split to provide bootstrap-based samping. We are going to use 20% of the dataset for testing and 80% for training the models.

Naïve Bayes

Model Training

Model Evaluation

Linear Discriminant Analysis (LDA)

Model Training

Model Evaluation

Quadratic Discriminant Analysis (QDA)

Model Training

Model Evaluation

Models summary and conclusion

**Conclusion**: Naïve Bayes classifier provides the highest accuracy, f1-score and precisions, but on the other hand LDA and QDA both have the same recall which is higher that the Naïve Bayes's. The performance achived by the tree models is nearly the same, and the Naïve Bayes dominates.